perf: Improve performance of `split_part` by andygrove · Pull Request #19570 · apache/datafusion

andygrove · 2025-12-30T22:02:03Z

Which issue does this PR close?

Closes #.

Rationale for this change

I ran microbenchmarks comparing DataFusion with DuckDB for string functions (see apache/datafusion-benchmarks#26) and noticed that DF was very slow for split_part.

This PR fixes some obvious performance issues. Speedups are:

Benchmark	Before	After	Speedup
single_char_delim/pos_first	1.27ms	140µs	9.1x faster
single_char_delim/pos_middle	1.39ms	396µs	3.5x faster
single_char_delim/pos_last	1.47ms	738µs	2.0x faster
single_char_delim/pos_negative	1.35ms	148µs	9.1x faster
multi_char_delim/pos_first	1.22ms	174µs	7.0x faster
multi_char_delim/pos_middle	1.22ms	407µs	3.0x faster
string_view_single_char/pos_first	1.42ms	139µs	10.2x faster
many_parts_20/pos_second	2.48ms	201µs	12.3x faster
long_strings_50_parts/pos_first	8.18ms	178µs	46x faster

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

andygrove · 2025-12-30T22:04:13Z

datafusion/functions/src/string/split_part.rs

        .try_for_each(|((string, delimiter), n)| -> Result<(), DataFusionError> {
            match (string, delimiter, n) {
                (Some(string), Some(delimiter), Some(n)) => {
-                    let split_string: Vec<&str> = string.split(delimiter).collect();


This was allocating strings for all parts even if only some parts were needed

comphead · 2025-12-30T22:25:31Z

46x faster 👍

comphead

Thanks @andygrove the early return makes much more sense than eagerly calculating all the parts

martin-g · 2025-12-31T05:37:23Z

datafusion/functions/src/string/split_part.rs

+                        std::cmp::Ordering::Greater => {
+                            // Positive index: use nth() to avoid collecting all parts
+                            // This stops iteration as soon as we find the nth element
+                            string.split(delimiter).nth((n - 1) as usize)


Are 32-bit systems supported ?
n is Int64, so it is possible that this cast may lead to a truncation or even a crash in debug build

Good catch, thanks. I changed to use try_into with appropriate error handling

martin-g · 2025-12-31T05:41:21Z

datafusion/functions/src/string/split_part.rs

+                        std::cmp::Ordering::Less => {
+                            // Negative index: use rsplit().nth() to efficiently get from the end
+                            // rsplit iterates in reverse, so -1 means first from rsplit (index 0)
+                            string.rsplit(delimiter).nth((-n - 1) as usize)


another corner case: -n will fail for i64::MIN

Good catch, thanks. I changed to use try_into with appropriate error handling

datafusion/functions/src/string/split_part.rs

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>

andygrove · 2026-01-06T16:39:44Z

Thanks for the reviews @viirya @comphead @martin-g @Jefffrey

andygrove added 2 commits December 30, 2025 15:01

optimize split_part

386176a

optimize split_part

36ea121

github-actions bot added the functions Changes to functions implementation label Dec 30, 2025

andygrove changed the title ~~optimize split_part~~ perf: Improve performance of split_part Dec 30, 2025

andygrove commented Dec 30, 2025

View reviewed changes

cargo fmt

04fe9b4

andygrove marked this pull request as ready for review December 30, 2025 22:06

andygrove added the performance Make DataFusion faster label Dec 30, 2025

andygrove requested review from Jefffrey, comphead and viirya December 30, 2025 22:09

comphead approved these changes Dec 30, 2025

View reviewed changes

viirya approved these changes Dec 30, 2025

View reviewed changes

Jefffrey approved these changes Dec 31, 2025

View reviewed changes

martin-g reviewed Dec 31, 2025

View reviewed changes

andygrove added 2 commits January 5, 2026 03:46

upmerge

36e6e1f

address feedback

0fd7f15

github-actions bot added the physical-expr Changes to the physical-expr crates label Jan 5, 2026

andygrove requested a review from martin-g January 5, 2026 10:54

martin-g reviewed Jan 5, 2026

View reviewed changes

datafusion/functions/src/string/split_part.rs Outdated Show resolved Hide resolved

revert accidental commit

0e09e23

github-actions bot removed the physical-expr Changes to the physical-expr crates label Jan 5, 2026

andygrove and others added 2 commits January 5, 2026 09:21

Apply suggestions from code review

bc002fa

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>

Merge branch 'main' into faster-split-part

4ac1887

andygrove added this pull request to the merge queue Jan 6, 2026

Merged via the queue into apache:main with commit 924037e Jan 6, 2026
31 checks passed

andygrove deleted the faster-split-part branch January 6, 2026 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Improve performance of `split_part`#19570

perf: Improve performance of `split_part`#19570
andygrove merged 8 commits intoapache:mainfrom
andygrove:faster-split-part

andygrove commented Dec 30, 2025 •

edited

Loading

Uh oh!

andygrove Dec 30, 2025

Uh oh!

comphead commented Dec 30, 2025

Uh oh!

comphead left a comment

Uh oh!

martin-g Dec 31, 2025

Uh oh!

andygrove Jan 5, 2026 •

edited

Loading

Uh oh!

martin-g Dec 31, 2025

Uh oh!

andygrove Jan 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

andygrove commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

andygrove commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

andygrove Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

comphead commented Dec 30, 2025

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

martin-g Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-g Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andygrove commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

andygrove commented Dec 30, 2025 •

edited

Loading

andygrove Jan 5, 2026 •

edited

Loading

andygrove Jan 5, 2026 •

edited

Loading